农业实践中的一个重要和繁琐的任务之一是检测作物疾病。它需要巨大的时间和熟练的劳动力。本文提出了一种智能有效的方法,用于检测使用计算机视觉和机器学习技术的作物疾病。该拟议的系统能够检测5种常见植物的20个不同疾病,精度为93%。
translated by 谷歌翻译
In this paper we explore the task of modeling (semi) structured object sequences; in particular we focus our attention on the problem of developing a structure-aware input representation for such sequences. In such sequences, we assume that each structured object is represented by a set of key-value pairs which encode the attributes of the structured object. Given a universe of keys, a sequence of structured objects can then be viewed as an evolution of the values for each key, over time. We encode and construct a sequential representation using the values for a particular key (Temporal Value Modeling - TVM) and then self-attend over the set of key-conditioned value sequences to a create a representation of the structured object sequence (Key Aggregation - KA). We pre-train and fine-tune the two components independently and present an innovative training schedule that interleaves the training of both modules with shared attention heads. We find that this iterative two part-training results in better performance than a unified network with hierarchical encoding as well as over, other methods that use a {\em record-view} representation of the sequence \cite{de2021transformers4rec} or a simple {\em flattened} representation of the sequence. We conduct experiments using real-world data to demonstrate the advantage of interleaving TVM-KA on multiple tasks and detailed ablation studies motivating our modeling choices. We find that our approach performs better than flattening sequence objects and also allows us to operate on significantly larger sequences than existing methods.
translated by 谷歌翻译
Diabetic Retinopathy (DR) is considered one of the primary concerns due to its effect on vision loss among most people with diabetes globally. The severity of DR is mostly comprehended manually by ophthalmologists from fundus photography-based retina images. This paper deals with an automated understanding of the severity stages of DR. In the literature, researchers have focused on this automation using traditional machine learning-based algorithms and convolutional architectures. However, the past works hardly focused on essential parts of the retinal image to improve the model performance. In this paper, we adopt transformer-based learning models to capture the crucial features of retinal images to understand DR severity better. We work with ensembling image transformers, where we adopt four models, namely ViT (Vision Transformer), BEiT (Bidirectional Encoder representation for image Transformer), CaiT (Class-Attention in Image Transformers), and DeiT (Data efficient image Transformers), to infer the degree of DR severity from fundus photographs. For experiments, we used the publicly available APTOS-2019 blindness detection dataset, where the performances of the transformer-based models were quite encouraging.
translated by 谷歌翻译
Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent large-scale text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a large-scale challenge dataset SR2D that contains sentences describing two objects and the spatial relationship between them. We construct and harness an automated evaluation pipeline that employs computer vision to recognize objects and their spatial relationships, and we employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although recent state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations such as left/right/above/below. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgment about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I spatial reasoning research.
translated by 谷歌翻译
This paper creates a novel method of deep neural style transfer by generating style images from freeform user text input. The language model and style transfer model form a seamless pipeline that can create output images with similar losses and improved quality when compared to baseline style transfer methods. The language model returns a closely matching image given a style text and description input, which is then passed to the style transfer model with an input content image to create a final output. A proof-of-concept tool is also developed to integrate the models and demonstrate the effectiveness of deep image style transfer from freeform text.
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).
translated by 谷歌翻译
引入逻辑混淆是针对集成电路(IC)的多个硬件威胁的关键防御,包括反向工程(RE)和知识产权(IP)盗窃。逻辑混淆的有效性受到最近引入的布尔满意度(SAT)攻击及其变体的挑战。还提出了大量对策,以挫败SAT袭击。不论针对SAT攻击的实施防御,大型权力,性能和领域的开销是必不可少的。相比之下,我们提出了一种认知解决方案:基于神经网络的UNSAT子句翻译器Satconda,它会造成最小的区域和开销,同时以无法穿透的安全性保留原始功能。 SATCONDA与UNSAT子句生成器一起孵育,该生成器通过最小的扰动(例如包含一对逆变器或缓冲液)转换现有的结合性正常形式(CNF),或者根据提供的CNF添加新的轻巧UNSAT块。为了有效的Unsat子句生成,Satconda配备了多层神经网络,该网络首先了解特征(文字和条款)的依赖性,然后是一个长期 - 长期内存(LSTM)网络,以验证和回溯SAT-硬度,以更好地学习和翻译。我们拟议的Satconda在ISCAS85和ISCAS89基准上进行了评估,并被认为可以防御为硬件RE设计的多个最先进的SAT攻击。此外,我们还评估了针对Minisat,Lingeling和葡萄糖SAT求解器的拟议SATCONDAS经验性能,这些溶剂构成了许多现有的Deobfuscation SAT攻击。
translated by 谷歌翻译
从废物电气和电子设备(WEEE)中有效拆卸和回收材料是将全球供应链从碳密集型,采矿材料转移到可回收和可再生的材料的关键步骤。常规的回收过程依赖于切碎和分类废物流,但是对于由许多不同材料组成的Weee,我们探索了针对许多物体的靶向拆卸,以改善材料恢复。许多WEEE对象都共享许多关键特征,因此看起来非常相似,但是它们的材料组成和内部组件布局可能会有所不同,因此,对于随后的拆卸步骤,为准确的材料分离和恢复而具有准确的分类器至关重要。这项工作介绍了RGB-X(一种多模式图像分类方法),该方法利用了来自外部RGB图像的关键特征,并从X射线图像中生成的图像来准确地对电子对象进行分类。更具体地说,这项工作开发了迭代类激活映射(ICAM),这是一种新型的网络体系结构,明确地侧重于用于准确的电子对象分类所需的多模式特征映射中的细节。为了培训分类器,由于费用和需要专家指导,电子对象缺乏大型且注释良好的X射线数据集。为了克服这个问题,我们提出了一种新的方法,可以使用应用于X射线域的域随机化创建合成数据集。合并的RGB-X方法使我们在10代现代智能手机上的准确度为98.6%,其单独的精度为89.1%(RGB)和97.9%(X射线)。我们提供实验结果3来证实我们的结果。
translated by 谷歌翻译
基于自我监督的基于学习的预科可以使用小标签的数据集开发可靠和广义的深度学习模型,从而减轻了标签生成的负担。本文旨在评估基于CL的预处理对可转介的性能与非转介糖尿病性视网膜病(DR)分类的影响。我们已经开发了一个基于CL的框架,具有神经风格转移(NST)增强,以生成具有更好表示和初始化的模型,以检测颜色底面图像中的DR。我们将CL预估计的模型性能与用成像网权重预测的两个最先进的基线模型进行了比较。我们通过减少标记的训练数据(降至10%)进一步研究模型性能,以测试使用小标签数据集训练模型的鲁棒性。该模型在EYEPACS数据集上进行了培训和验证,并根据芝加哥伊利诺伊大学(UIC)的临床数据进行了独立测试。与基线模型相比,我们的CL预处理的基础网模型具有更高的AUC(CI)值(0.91(0.898至0.930),在UIC数据上为0.80(0.783至0.820)和0.83(0.783至0.820)(0.801至0.853)。在10%标记的培训数据时,在UIC数据集上测试时,基线模型中的FoldusNet AUC为0.81(0.78至0.84),比0.58(0.56至0.64)和0.63(0.56至0.64)和0.63(0.60至0.66)。基于CL的NST预处理可显着提高DL分类性能,帮助模型良好(可从Eyepacs转移到UIC数据),并允许使用小的带注释的数据集进行培训,从而减少临床医生的地面真相注释负担。
translated by 谷歌翻译